Frontiers in Genetics — Latest Matching Preprints

1

A Foundational Exome Resource for Jordan: Dual Ancestry Admixture and Population-Specific Variants to Improve Clinical Variant Interpretation

Froukh, T.

2026-05-27 genetic and genomic medicine 10.64898/2026.05.23.26353895 medRxiv

Top 0.1%

26.0%

Show abstract

Currently, the genetic architecture of Middle Eastern populations is underrepresented in global genomic databases. This gap increases the rate of Variants of Uncertain Significance (VUSs) and clinical misinterpretations of genomic data especially in Middle Eastern populations. Whole exome sequencing was conducted on 90 healthy individuals from Jordan and the data were analysed using Principal Component Analysis (PCA) and multi-computational filtering. PCA revealed a double ancestry (EUR-AFR) admixture rather than a triple admixture (EUR-AFR-AMR). More than 3,500 populations-specific variants (PSVs) were identified, of which 72% were singletons. Additionally, 19 variants were significantly enriched compared to the maximum allele frequencies in public global databases (Fisher's exact test with Benjamini-Hochberg false discovery rate correction, p-value < 0.05). Consequently, the results suggest the reclassification of variants of Uncertain Significance (VUS) which reside in the ECE2 gene to likely benign and the variants of Conflicting Classification of Pathogenicity in the genes IL1RN and THPO to benign based on the significant allele frequency (AF=0.0389, p-value < 0.05). Furthermore, a pathogenic ClinVar variant was identified in a healthy individual, warranting careful interpretation. The findings underscore the importance of identifying PSVs in order to minimize or even prevent clinical misdiagnosis and highlight the unique genetic signature in Jordan. The study serves as a foundational resource for precision medicine in the region.

2

Recurrent LINE 1 exonization drives transcriptome remodelling in NSCLC

Parida, A. S.; Kumar, A.; Tiwari, B.

2026-04-24 genomics 10.64898/2026.04.22.720055 medRxiv

Top 0.1%

22.1%

Show abstract

The only autonomously active transposable elements in the human genome are Long interspersed nuclear element-1 (LINE-1) elements. These elements are known to play an important role in changing the transcriptome. LINE-1 sequences affect gene regulation during post-transcription processing, along with their established role in retrotransposition. Exonization is one mechanism where the LINE-1 integrated genome undergoes alternative splicing to produce new isoforms of transcripts. Our work mainly highlights the effect of LINE-1 associated exonization, focusing on the formation of isoforms of transcripts. Using Non-small cell lung cancer (NSCLC) as a model, we conducted a detailed transcriptome study that combines splice junction profiling with gene expression data. Our results show that LINE-1 sequences are often included as exons in host transcripts, leading to the formation of new exons and their various isoforms. The events are validated by solid splice junction evidence that proves the reliability and reproducibility. In particular, it was observed that repetitive analyses revealed certain LINE-1 exonization events that were consistent. The finding indicates that LINE-1 act as recurrent sources of splice ready sequences. Though exonizations do not necessarily affect the total expression levels of genes, our study reveals that they certainly contribute to transcript diversity. The diversity of isoforms generated potentially contributes to the effects of gene function. This study is limited to NSCLC, but it is likely that the exonizations events play a crucial role in the altering RNA diversity in cancers. Therefore the study elucidates new insights into how transposable elements modify gene structure and function during cancer development.

3

Genomic indicators of gene function: A systematic assessment of the human genome

Cooper, H. B.; Rojas Lopez, K. E.; Schiavinato, D.; Black, M. A.; Gardner, P. P.

2026-04-09 genomics 10.64898/2026.04.08.717348 medRxiv

Top 0.1%

18.2%

Show abstract

Proteins and non-coding RNAs are functional products of the genome that are central for crucial cellular processes. With recent technological advances, researchers can sequence genomes in the thousands and probe numerous genomic activities of many species and conditions. Such studies have identified thousands of potential proteins, RNAs and associated activities. However there are conflicting interpretations of the results and therefore which regions of the genome are "functional". Here we investigate the relative strengths of associations between coding and non-coding gene functionality and genomic features, by comparing reliably annotated functional genes to non-genic regions of the genome. We find that the strongest and most consistent association between functional genes and genomic features are transcriptional activity and evolutionary conservation. We also evaluated sequence-based statistics, genomic repeats, epigenetic and population variation data. Other features strongly associated with function include histone marks, chromatin accessibility, genomic copy-number, and sequence alignment statistics such as coding potential and covariation. We also identify potential issues with SNP annotations in short non-coding RNAs, as some highly conserved ncRNAs have significantly higher than expected SNP densities. Our results demonstrate the importance of evolutionary conservation and transcription activity for indicating protein-coding and non-coding gene function. Both should be taken into consideration when differentiating between functional sequences and biological or experimental noise.

4

A Bayesian multidimensional approach to decipher the genetic basis of dynamic phenotypes in multiple species

Blois, L.; Heuclin, B.; Bernard, A.; Denis, M.; Dirlewanger, E.; Foulongne-Oriol, M.; Marullo, P.; Peltier, E.; Quero-Garcia, J.; Marguerit, E.; Gion, J.-M.

2026-04-03 genetics 10.64898/2026.04.01.715770 medRxiv

Top 0.1%

15.0%

Show abstract

Deciphering the genetic architecture of complex quantitative phenotypes remains challenging in quantitative genetics. These traits not only depend of multiple genetic factors but are also established over time and environments. Although quantitative genetics has investigated the genetic determinism of phenotypic plasticity in contrasted environmental conditions, the time related phenotypic plasticity has received less attention. Here we proposed a multivariate Bayesian framework, the Bayesian Varying Coefficient Model, designed for analysing the genetic architecture of the time related phenotypic plasticity by a multilocus approach. We applied the BVCM to time series phenotypes measured at various time scales (daily, monthly, yearly) across a diverse set of biological species. We included in this study: yeast (Saccharomyces cerevisiae), fungi (Fusarium graminearum), eucalyptus (Eucalyptus urophylla x E. grandis), and sweet cherry tree (Prunus avium). The BVCM results were compared with those obtained with a known genome-wide association method carried out time by time. For all species and traits, the BVCM was able to detect the major QTL identified by marker-trait association methods and revealed additional genetic regions of weak effect. It also increased the phenotypic variance explained for most of the phenotypes considered. It revealed dynamic QTLs with transitory, increasing or decreasing effects over time. By considering both the temporal and genetic multivariate structures in a single statistical model, we increased our understanding of the genetic architecture of complex traits notably by reducing the issue of missing heritability. More broadly, this work raises the foundation for extended applications in functional genomics, evolutionary ecology, and crop breeding programs, in which time-related phenotypic plasticity remains crucial for predicting and selecting key quantitative complex traits. Key messageBy capturing the genetic factors influencing the time related phenotypic plasticity, our approach contributes to a deeper understanding of the dynamic nature of genotype-phenotype relationships.

5

A telomere-to-telomere (T2T) pig genome assembly reveals Y chromosome diversity and structural variations of Wuzhishan pigs

Ren, Y.; Wang, F.; Li, X.; Liu, G.; Sun, R.; Zheng, X.; Zhang, Y.; Lin, R.; Lu, X.; Chen, L.; Xin, W.; Fei, Y.; Chao, Z.

2026-04-27 genomics 10.64898/2026.04.23.720499 medRxiv

Top 0.1%

14.9%

Show abstract

BackgroudWuzhishan (WZS) pigs are native to Hainan Province of China, and serve as both important agricultural resources and biomedical models. Although the published WZS pig genome (T2T-pig1.0) even achieving telomere-to telomere (T2T) completeness, substantial genetic diversity still exists within the same pig breed, another WZS pig genome named WZS-T2T was assembled in this study. ResultsMultiple sequencing data were used to assemble genome, and finally yielded a [~]2.68 Gb telomere-to-telomere genome, with N50 length [~]142.87 Mb, and annotated protein coding genes of 23,100. Compared to T2T-pig1.0, QV and BUSCO value was higher, and the Y chromosome (ChrY) length was longer in WZS-T2T than that of T2T-pig1.0. ChrY of two WZS pigs shared 11 genes, including sex differentiation-related genes of SHOX, PRKX, and DDX3X, and SRY; however, energy metabolism gene SLC25A4 and the macrophage-related receptor gene CSF2RA of ChrY were specific to WZS-T2T. An inversion SV on chromosome 10 with length [~]33.86 Mb was identified between two WZS pigs, and three proofs were proposed for proving the accuracy sequence orientation of WZS-T2T.The genetic diversity was consistent with LD decay speed in population different analysis. WZS pigs exhibited higher genetic diversity than other four pig populations (Tunchang pigs, Yuxi black pigs, Large White pig, and Duroc pigs) examined in this study, and presented slower LD decay compared to other four breeds. ConclusionsTherefore, WZS-T2T provided a higher-quality assembly, and potential advantages of both agricultural production and biomedical targets for WZS pigs.

6

Incomplete Dominance of ASIP Alleles in Hungarian Puli Dogs is Associated with MC1R Mutation

Belyakin, S. N.; Maksimov, D. A.; Pobedintseva, M. A.; Laktionov, P. P.; Mikhnevich, N. V.; Sipin, F. A.; Krylova, M. I.

2026-03-19 genetics 10.64898/2026.03.17.712399 medRxiv

Top 0.1%

14.8%

Show abstract

Alleles of ASIP gene (Agouti locus) in dogs determine a wide spectrum of coat colors, from red to black. Gain-of-function Ay allele is the most dominant in the range of known ASIP mutations: when all other genes affecting coat pigmentation are intact, presence of Ay allele results in red coat color. Loss-of-function a allele is the most recessive allele of this gene. When homozygous, it gives black coat color. Usually, dogs with Ay/a genotype have red coat, because a single copy of Ay allele is sufficient to fully compensate for the non-functional allele a, implying the complete dominance in this pair of alleles. However exceptions are known. In the Hungarian Puli breed there is a specific coat pigmentation type called fako. We investigated the genetic composition of fako dogs and found evidence that the dominance of the Ay allele over the a allele may be incomplete in these dogs. Analysis of the MC1R gene that interacts with ASIP in the hair pigmentation genetic cascade allowed us to find the variants that may be responsible for the incomplete dominance of Ay allele over a allele in Hungarian Puli dogs.

7

Clarified an rDNA Gene Unit Pattern with (CTTT)n and (CT)n Microsatellites Aggregation Ahead of and Behind the Gene in Human Genome

Shen, J.; Tang, S.; Xia, Y.; Qin, J.; Xu, H.; Tan, Z.

2026-03-24 genetics 10.64898/2026.03.22.713381 medRxiv

Top 0.1%

14.5%

Show abstract

BackgroundConventional models of human ribosomal DNA (rDNA) array organization have historically depended on transcription-centric boundaries, partitioning the unit into a [~]13 kb rDNA transcription region and a monolithic [~]31 kb intergenic spacer (IGS). While our previous identification of Duplication Segment Units (DSUs) mapped these arrays based on an intuitive analysis of the microsatellite density landscape of the complete reference human genome, our present deep mining of this landscape has revealed a more accurate rDNA Gene Unit Pattern. Methods & ResultsIn this study, we conducted a deep mining analysis of our previously established microsatellite density landscape of the T2T-CHM13 assembly, focusing specifically on nucleolar organizing regions (NORs). We suggest a more accurate rDNA Gene Unit Pattern containing a (CTTT)n microsatellite aggregation ahead of the rDNA gene and a (CT)n microsatellite aggregation behind the gene, rather than a pattern featuring an IGS region inserted between two rDNA genes. ConclusionsA correct rDNA gene pattern of the human genome probably includes a (CTTT)n microsatellite aggregation ahead of the gene and a (CT)n microsatellite aggregation behind it, which possibly constitute cis- and trans-regulating regions; the (CTTT)n and (CT)n microsatellite aggregations may provide two different local stable DNA structures for regulatory protein binding.

8

Transposable elements as new players to decipher sex differences in Parkinson Disease

Gordillo-Gonzalez, F.; Galiana-Rosello, C.; Grillo-Risco, R.; Soler-Saez, I.; Hidalgo, M. R.; Siomi, H.; Kobayashi-Ishihara, M.; Garcia-Garcia, F.

2026-03-30 bioinformatics 10.64898/2026.03.27.714370 medRxiv

Top 0.1%

14.5%

Show abstract

We present a novel integrative analysis of transposable elements (TEs) in 4 single cell RNA-seq (scRNA-seq) datasets of postmortem substantia nigra pars compacta samples of Parkinson Disease (PD) patients matched healthy controls, with the objective of building a cell-type specific trustworthy atlas of TEs that may clarify the role of TEs in sex differences in PD. We have used the soloTE tool to evaluate the TEs expression changes across all snRNA-seq studies identified in our previous systematic review, and then integrated the results using meta-analysis techniques. Finally, we evaluated the possible associations between TEs and protein coding genes by integrating our previous results in this matter with the information of TEs obtained, in order to propose the possible action mechanism by which some of the TEs contribute to PD.

9

Suidae iPSC-derived macrophages as models for investigating susceptibility and resilience to African swine fever virus

Watson, T. M.; Goatley, L. C.; Meek, S.; Eory, L.; Kohler, S.; Berkley, N.; Sternberg, S.; Jackson, M.; Findlay, A.; Hoskins, I.; Girling, S.; Mee, J.; Archibald, A. L.; Grey, F.; Steinbach, F.; Crooke, H.; Netherton, C. L.; Burdon, T.

2026-04-22 developmental biology 10.64898/2026.04.22.719209 medRxiv

Top 0.2%

12.6%

Show abstract

African swine fever virus (ASFV) causes a lethal haemorrhagic fever in pigs and spread of this disease threatens many pig species (Suidae) globally. By contrast, ASFV infections in the natural evolved hosts, the warthog and bushpig, are subclinical. The macrophage (M{varphi}) is the primary target of ASFV and species-dependent responses in M{varphi}s are presumed to influence disease susceptibility. In an attempt to model these differences in vitro, we generated transgene-regulated induced pluripotent stem cells (iPSCs) from domestic pig, wild boar, red river hog and warthog, and confirmed that their corresponding iPSC-derived M{varphi}s (iPSCdMs) supported infection and replication of ASFV. In contrast to the other species, however, warthog iPSCdMs did not induce interferon upon infection by either virulent or attenuated ASFV. iPSCdMs may therefore represent an experimental system to understand how ASFV infection of M{varphi}s contributes to disease and aid development of strategies to combat this economically and environmentally devastating pathogen.

10

Novel Prion Protein Gene (PRNP) Variants in Wild Montana Mule Deer

Seerley, A. L.; Rothfuss, M. T.; Gray, B. M.; Sebogo, M. A.; Manakelew, B. A.; Pounder, J. I.; Bowler, B. E.; Leavens, M. J.; Grindeland Panter, A. L.

2026-03-19 genetics 10.64898/2026.03.17.711390 medRxiv

Top 0.2%

10.9%

Show abstract

Chronic Wasting Disease (CWD) is a transmissible spongiform encephalopathy (TSE) of cervids (elk, deer, moose, and reindeer) that is increasing in prevalence and expanding to new geographical areas. TSEs, commonly referred to as prion diseases, are fatal neurodegenerative diseases that occur in a variety of mammals, including humans, and typically exhibit species-specific characteristics. This study reports the sequencing of the prion protein gene (PRNP) in retropharyngeal lymph node samples from 358 Montana mule deer (Odocoileus hemionus) and the identification of 36 PRNP genetic variants, many of which have not been reported previously. Further investigations tracked spatiotemporal characteristics of variants to hunting districts, year of harvest, and CWD status. PRNP polymorphisms V12F, D20G, R40Q, and S225F were examined with EmCAST computational predictions to determine the relationship between sequence and structural variations providing further insights into mechanisms affecting CWD outcomes. EmCAST predictions suggest the novel variant V12F phenotype is attributable to functional changes such as altered protein-protein interactions that might be linked to the CWD positive status of the samples. Notably, the analysis of S225F by EmCAST predicted that S225F is a neutral mutation for folded PrP and incompatible with fibril PrP, suggesting a potential structural mechanism for why this previously known variant may provide protection against CWD based on reduced fibril PrP formation. The CWD-positive samples harboring PRNP variants were examined with the prion RT-QuIC assay, including the novel variant V12F, which resulted in prion seeding activity. Author SummaryChronic Wasting Disease (CWD) is a fatal disease of cervids, which include deer, elk, and moose. Since its discovery in 1967, CWD has spread to 36 U.S. states and four Canadian provinces, with prevalences exceeding 20% in select free-ranging populations. With the popularity of hunting big game animals and the role of these species in the ecosystem, concerns have arisen regarding the transmission of disease to humans, as well as how to mitigate long term consequences of disease on animal populations. Given the significant risk of species spillover and the limitations of current management, innovative genetic research is essential. Our study identified novel PRNP genetic variants in Montana mule deer, cataloging their regional distribution and CWD status across several hunting seasons. By investigating the impact of these polymorphisms on protein stability and seeding activity, we provide critical insights into the genetic factors that influence disease phenotypes and transmissibility in wild cervid populations.

11

A pilot genome-wide association study of ischemic heart disease with co-occurring arterial hypertension in a Kazakh cohort

Skvortsova, L.; Yergali, K.; Zhaxylykova, A.; Begmanova, M.; Mansharipova, A.

2026-03-23 genetic and genomic medicine 10.64898/2026.03.19.26348868 medRxiv

Top 0.2%

10.6%

Show abstract

Genome-wide association studies (GWAS) of ischemic heart disease (IHD) remain underrepresented in Central Asian populations. We conducted a pilot GWAS of IHD with co-occurring arterial hypertension in a Kazakh cohort to identify candidate loci for future replication. A case-control GWAS was performed in 451 individuals (236 cases and 215 controls). Genotyping was conducted using the Illumina Infinium Global Screening Array-24 v3.0. Association testing was performed using a logistic regression under an additive genetic model adjusted for age, sex and the first ten principal components (PC1 - PC10). Multiple testing correction was applied using the Bonferroni adjustment. As an additional analysis, knowledge-guided GWAS (KGWAS) followed by MAGMA gene-based testing was used to prioritize candidate genes. After quality control, 345 371 variants were tested. Two loci surpassed the Bonferroni-corrected genome-wide significance threshold: rs28898595 at the UGT1A locus (effect allele C; OR = 0.33, 95% CI = 0.23 - 0.49; p = 3.01x10-8) and rs28709059 in the intron region of the ACTR3C gene (effect allele C; OR = 0.4, 95% CI = 0.29 - 0.55; p = 4.08x10-8). Several additional loci showed suggestive evidence of association. In gene-level analysis, the CSMD1 gene demonstrated a significant association signal in MAGMA consistent with the European (p = 1.16x10-11) and East Asian (p = 9.07x10-11) LD reference panels. This pilot study identifies genome-wide significant loci (UGT1A, ACTR3C genes) and supports CSMD1 gene as a prioritized candidate gene for the complex phenotype of IHD associated with co-occurring arterial hypertension in the Kazakh cohort. These findings are preliminary and require replication in larger Central Asian cohorts and further functional validation.

12

Deep analysis of FANTOM CAGE data reveals hierarchical patterns of TSS co-deployment hubs and their disruption in cancers

Meduri, R.; Satish, A. L.; Singh, U.

2026-05-18 genomics 10.64898/2026.05.15.725323 medRxiv

Top 0.2%

10.6%

Show abstract

Selective deployment of multiple transcription start sites is a major regulatory feature of human transcriptomes. FANTOM CAGE data exhibit a near-universal TSS deployment parsimony which is disrupted in cancers. We have recently shown that TSS deployment is sensitive to gene function, futile upstream transcription, and cellular biosynthetic states. Patterns in FANTOM CAGE data can reveal mechanisms underlying TSS co-deployments. We propose and test the possibility that some TSSs act like epromoters and act as co-varying hubs of transcriptional activities for multiple other promoters. Using deep analysis of CAGE data implemented through neural networks we show that non-cancers implement transcription co-deployments through cores of epromoter-like TSSs which are generally proximal to their start codons. These TSSs show enhancer-like TFBSs profiles. A comparison with cancer CAGE data shows that the concentrated epromoter core is disrupted in cancers with multiple distal TSSs replacing the proximal TSS cores. We provide evidence that the core TSSs are rich in YY1 and CTCF binding sites and associated with genes coding for transcription factors. Our findings show that covariance of TSS deployment is sensitive to transcriptional resource cost and a parsimonic design of TSS co-deployments depends on proximal TSSs in non-cancers, a mechanism grossly disrupted in cancers. HighlightsO_LIHeterogeneous FANTOM CAGE data contains universal patterns of TSSs co-deployments. C_LIO_LITSS co-deployments exhibit a parsimonious "core-covariant" scheme which is disrupted in cancers. C_LIO_LICore TSSs are enriched in transcription factor binding sites and gene functions which justify biological features of the samples. C_LIO_LIThe DL pipeline we present identifies the core-covariant TSS sets in an unbiased manner. C_LI

13

Integrative Identification and Characterization of PCOS-Associated lncRNAs From the Interface of Genetic Association, Transcriptomics, and Gene Structure Evolution

He, Z.; Li, Y.; Shkurat, T. P.; Butenko, E. V.; Derevyanchuk, E. G.; Lomteva, S. V.; Chen, L.; Lipovich, L.

2026-04-02 genomics 10.64898/2026.03.31.715548 medRxiv

Top 0.2%

10.5%

Show abstract

BackgroundPolycystic ovary syndrome (PCOS) is a prevalent endocrine disorder and a leading cause of female infertility, with complex genetic, metabolic, and hormonal etiologies. Long non-coding RNAs (lncRNAs) have emerged as important regulators of diverse biological processes, yet their roles in PCOS remain underexplored. Here, we identified and characterized PCOS differentially expressed gene-associated lncRNAs (PDEGAL) with an integrative approach combining expression data, genetic association, and evolutionary analysis. MethodsThirty-three PCOS-associated protein-coding genes were obtained from our prior study, and all their nearby and overlapping lncRNAs were annotated. These candidates were analyzed using UCSC Genome Browser-mapped annotations and datasets, including NCBI RefSeq, GENCODE, GTEx, GWAS SNPs, and conservation, as well as the FANTOM5 cap analysis of gene expression (CAGE) promoter data, to assess their expression, regulatory potential, genetic variant overlaps, and evolutionary conservation. ResultsTwenty-three PDEGALs (18 antisense to, and 5 sharing bidirectional promoters with, known PCOS-associated protein-coding genes) were identified. 17 PDEGALs contained GWAS SNPs with statistically significant disease associations, 9 of which were associated with PCOS-related traits. 5 PDEGALs demonstrated expression in the KGN granulosa cell model of PCOS. Key gene structure element (KGSE) analysis revealed that most PDEGALs are primate-specific. Integrating four criteria--GTEx expression, GWAS SNPs, FANTOM promoterome, and KGSE conservation--highlighted HELLPAR as the only lncRNA fulfilling all four, while five others--PGR-AS1, MTOR-AS1, ENSG00000265179, ENSG00000256218, and LOC105377276--fulfilled three of the four criteria. ConclusionsWe have systematically identified candidate PCOS regulatory lncRNAs with convergent genetic, expression, and evolutionary evidence. These results provide a framework for functional validation and highlight lncRNAs as potential biomarkers and therapeutic targets in PCOS that function by regulating their nearby and overlapping protein-coding genes.

14

Detection and evaluation of copy number variation using both linked-read and short-read sequencing in New Zealand dairy cattle

Wang, Y.; Nugroho, T.; Johnson, T. J. J.; Couldrey, C.; Harris, B. L.

2026-04-23 bioinformatics 10.64898/2026.04.20.718595 medRxiv

Top 0.2%

10.4%

Show abstract

In recent years, genetic studies have made significant progress in identifying single-nucleotide polymorphisms (SNPs) associated with cattle health and production traits. However, it is still challenging to identify and validate more complicated forms of variation, such as copy number variation (CNV) and other types of structural variation (SV). In this study, SV regions were identified using 37 New Zealand dairy cattle with linked-read sequence data. A transmission-based framework was used to validate these variants at the population scale. 62,438 putative autosomal SV regions were identified with the LongRanger pipeline following the 10x Genomics recommendations. Copy number states for these regions were subsequently estimated via a read-depth based genotyping method using CNVpytor in a population-representative cohort of 2306 animals using Illumina short-read sequencing technology. Mendelian inheritance of copy number states was assessed using linear mixed models incorporating pedigree information, and transmission levels were used to quantify the biological validity of each CNV region. Transmission levels ranged widely, with a mean of 0.5162 across all regions, where higher transmission levels were proportionally enriched for larger SVs. A total of 7218 CNV regions exhibited high transmission levels (>0.9), indicating strong evidence of inheritance. Among these, 7136 overlapped CNV regions reported in one or more public datasets, while 82 high-confidence regions represent previously unreported variants. High-transmission CNV regions tended to show clear, discrete inheritance patterns in trio families, providing the biological evidence that these CNVs are inherited within the population. Together, these results demonstrate that integrating linked-read sequencing with population-scale transmission-based validation provides a robust framework for identifying high-confidence CNV regions. This catalogue of validated CNV regions represents an important resource for downstream functional analyses and the incorporation of structural variation into genomic selection and breeding programs.

15

Genomic Variability of the HCT116 Cell Line Identified Using Oxford Nanopore Sequencing

Leonov, P.; Mikheeva, R.; Koryukov, M.; Ruleva, E.; Karabut, E.; Kechin, A.

2026-04-24 genomics 10.64898/2026.04.23.720331 medRxiv

Top 0.3%

10.2%

Show abstract

HCT116 is a colorectal cancer cell line frequently used in anti-tumor drug development experiments as well as in studies of the molecular machinery of eukaryotic cells. It is well characterized by the presence of several single-nucleotide and short mutations in multiple oncogenes and tumor suppressor genes, including KRAS, PIK3CA, MLH1, CTNNB1, CDKN2A, TGFBR2, and BRCA2. However, its landscape of large genomic rearrangements (LGRs) and copy number variants (CNVs) is still far from being fully understood. Therefore, the aim of this study was to identify LGRs and CNVs in several HCT116 cell line samples using Oxford Nanopore sequencing technology, including three samples from the SRA NCBI database, and to compare common and unique variants across all samples. Using the recently developed eLaRodON tool, we identified 22,666 common LGRs, among which more than 70% of tandem duplications and deletions larger than 80 kb were confirmed by CNV analysis. Among LGRs affecting protein-coding sequences, two in-frame rearrangements were identified: a deletion of exons 4-6 and a duplication of exon 10 in the CCSER1 gene, which encodes a cell division regulator protein. Given its high rearrangement rate in various tumors and the clinical significance of its overexpression, this finding may be potentially useful in future research on this cell line. Regarding differences between samples, we found that LGRs in the laboratory sample and in one of the three SRA NCBI samples occurred more frequently via ALR/Alpha repeats than via Alu repeats, in contrast to common LGRs and those unique to the other samples, a finding that may indicate the presence of unique mechanisms of genomic instability. Thus, this study reveals a broad spectrum of large genomic rearrangements and copy number variants that can be identified in the HCT116 cell line using Oxford Nanopore sequencing, including rearrangements specific to distinct cell line samples.

16

Comparing bulk and single-cell methodologies and models to profile gene expression, chromatin accessibility and regulatory links in endothelial cells treated with TNFα

Zevounou, J.; Lo, K. S.; McGinnis, C. S.; Satpathy, A. T.; Lettre, G.

2026-03-16 genomics 10.64898/2026.03.13.711357 medRxiv

Top 0.3%

10.1%

Show abstract

Genome-wide association studies (GWAS) have identified thousands of non-coding variants associated with complex traits and diseases. However, it remains challenging to pinpoint the causal genes that are regulated by associated genetic variants. Connecting causal non-coding variants with genes can rely on methods that identify direct physical interactions (e.g. chromosome conformation capture) or on probabilistic models that predict regulatory links. These statistical models take advantage of gene expression and chromatin accessibility profiles generated in cells and tissues by bulk or single-cell (sc) methodologies. Here, we tested whether using bulk or sc RNAseq/ATACseq data and corresponding predictive enhancer-to-gene models impact the prioritization of causal GWAS genes. Using non-treated and TNF-treated human endothelial cells in vitro as a well-controlled experimental system, we show that bulk and sc RNAseq/ATACseq profiles are similar and highlight the same biology (e.g. biological pathways). Despite these similarities, we show using GWAS results for coronary artery disease (CAD) and diastolic blood pressure that applying enhancer-to-gene models designed for bulk or sc methodologies can yield differences in terms of captured heritability, fine-mapped variants and linked genes. For instance, at one CAD locus, the bulk-based ABC model predicts a regulatory link with BCAR1, whereas the sc-based model scE2G prioritizes a different gene (CFDP1). On the same experimental model, our results indicate that choosing between a bulk or sc approach will influence regulatory link model predictions; this should be considered when planning functional experiments to characterize GWAS discoveries.

17

Evaluation of the Contribution of Natural Selection to Greater Cardiometabolic Disease Risk in South Asian Populations

Searby, D. J. C.; Hemani, G.; Chong, A.; Lawson, D. J.; Chaturvedi, N. J.; Davey Smith, G.

2026-05-22 genetic and genomic medicine 10.64898/2026.05.15.26353234 medRxiv

Top 0.3%

10.1%

Show abstract

A greater genetic susceptibility has been proposed as an explanation of the greater rates of cardiovascular and metabolic disease in South Asian relative to European populations. We first demonstrate that after accounting for technical artefacts the genetic effects for related traits are largely consistent between ancestral groups, which downplays the role of GxG or GxE interactions driving differential prevalence. If higher genetic susceptibility in South Asians is due to selective pressures acting through adiposity-related traits in the evolutionary past, signatures of selection should be evident at loci associated with cardiometabolic disease and other causally related traits (e.g. fat distribution). We tested for enrichment of several selection statistics (FST, XP-EHH and XP-nSL) at loci associated with a range of traits related to cardiometabolic disease, in comparison to a null distribution of linkage disequilibrium (LD) score and minor allele frequency (MAF) matched SNPs. Loci associated with a subset of these traits (Type 2 diabetes mellitus, trunk fat percentage, body fat percentage and trunk fat mass) exhibited enrichment for FST, consistent with a moderate adaptive explanation for their cross-population differentiation. In contrast, none of the studied traits were enriched for haplotype-based statistics, indicative that cross population genetic divergence is unlikely to have been driven by recent selective sweeps but has rather likely arisen from either ancient selection or recent polygenic selection acting on standing variation.

18

Genome-wide identification of rhabdoviral sequences in alfalfa (Medicago sativa L.)

Grinstead, S.; Nemchinov, L. G.

2026-05-22 genomics 10.64898/2026.05.20.726541 medRxiv

Top 0.3%

10.1%

Show abstract

We recently reported the identification of endogenous viral elements (EVEs) originating from the Caulimoviridae family within the alfalfa (Medicago sativa L.) genome. Our subsequent identification of ubiquitous rhabdoviral elements in infected and healthy alfalfa tissues by high throughput sequencing prompted us to suggest that the alfalfa genome might be populated with integrated rhabdoviruses as well. Bioinformatics analysis using 26 publicly available alfalfa genomes proved the suggestion accurate. We found multiple non-retroviral segments of the Rhabdoviridae family belonging to the genera Betanucleorhabdovirus and Betacytorhabdovirus that appeared to be stable constituents of the host genome. In that capacity they could potentially acquire functional roles in alfalfas development and response to environmental stresses. We believe this study reveals the first documented case of rhabdoviruses integrated into the alfalfa genome.

19

Heat Stress Induces Locus-Specific DNA Hypomethylation Linked to Immune Regulation in Lactating Holstein Cows

Costa Monteiro Moreira, G.; Ruiz Gonzalez, A.; Joigner, M.; Costes, V.; Chaulot-Talmon, A.; Ali, F.; Bourgeois-Brunel, L.; Jammes, H.; Rico, D. E.

2026-03-26 genomics 10.64898/2026.03.23.713208 medRxiv

Top 0.3%

10.0%

Show abstract

Epigenetics may play a crucial role in livestock adaptation to environmental challenges like heat stress. In recent years, a growing number of studies have investigated the epigenetic mechanisms underlying dairy cow adaptation to heat stress. However, there is still limited knowledge about the effects of heat stress on immune cells and immune-related phenotypes. Herein we aim to identify heat-stress induced DNA methylation variations on blood methylome potentially affecting regulatory regions and associated phenotypes. Blood samples were collected and peripheral blood mononuclear cell (PBMC) isolated from four cows before (D0) and after (D14) a 14-d heat stress challenge (cyclical THI 72-82) and, from four cows kept in thermoneutral conditions (THI 61-64). Heat-stressed cows had ad libitum access to diets supplemented with adequate levels of vitamin D and Ca (12,000 IU/kg of vitamin D and 0.73% Ca, respectively). To eliminate confounding effects due to differences in nutrient intake, cows maintained under thermoneutral conditions were pair-fed (PF) to their heat-stressed counterparts and received adequate concentrations of vitamin D and Ca as well. Reduced representation bisulphite sequencing (RRBS) was used to profile PBMCs methylome. Differential methylation analysis was performed using methylKit and DSS softwares ({Delta}meth [≥] 25%, adjusted p-value < 0.01), retaining only commonly detected differentially methylated cytosines (DMCs). A total of 2,908 DMCs were identified when comparing pre- and post-heat stress samples. After excluding 649 DMCs that were also detected under thermoneutral conditions, as these changes were likely associated with feed restriction inherent to the pair-feeding design rather than with heat stress per se, 2,259 heat stress-specific DMCs remained, predominantly hypomethylated. About half of the DMCs are annotated in intronic and intergenic regions; known to harbor regulatory elements. By intersecting the DMRs with publicly available functional annotation data, we observed hypomethylation on regulatory regions putatively affecting cows immune system. As an example, we identified a loss of methylation within an enhancer region of the MSN gene, which is involved in lymphocyte homeostasis, and a loss of methylation in the promoter region of MECP2, a well-established epigenetic regulator with a central role in chromatin organization and gene expression. These findings highlight the impact of heat stress on dairy cow immunity and provide new insights into its epigenetic regulation under environmental stress. Interpretative summaryThis study examined DNA methylation changes induced by heat stress in dairy cows to elucidate epigenetic mechanisms of thermal adaptation. Using RRBS on PBMCs, 2,259 heat stress-specific differentially methylated cytosines were identified, predominantly hypomethylated and enriched in regulatory regions. Functional annotation highlighted immune-related pathways, including hypomethylated regulatory regions near genes (e.g., MSN, ZBTB33, SLC25A5, GNAS, FAM3A, and MECP2) associated with immune function. These findings indicate that heat stress induces targeted epigenetic modifications potentially affecting immune regulation in dairy cows.

20

Genetic Characterization of the TAPBP and Its Haplotypic Association with BF2 in the Chicken Major Histocompatibility Complex

Fernando, R.; Agulto, T. N.; Cho, E.; Kim, J.; van Hateren, A.; Kim, M.; Prabuddha, M.; Lee, J. H.

2026-04-23 genetics 10.64898/2026.04.20.719781 medRxiv

Top 0.4%

9.4%

Show abstract

TAPBP is a key chaperone of the peptide-loading complex that facilitates peptide loading onto major histocompatibility complex class I (MHC I) molecules. This study characterized TAPBP alleles in Korean Native Chickens (KNCs), identified novel variants, and evaluated haplotypic associations with BF2. Thirty-six samples representing six KNC lines were genotyped using LEI0258 and the MHC-B SNP panel, and individuals homozygous at both markers were classified into 16 groups. The same samples were subjected to Sanger sequencing of TAPBP exons 3-8. Sequences were assembled and aligned against MHC-B reference haplotypes and the Red Junglefowl reference. Additional comparisons with "tapasin allele" datasets enabled the identification of novel variants. Six novel nucleotide variants were detected across exons 3-6, including one nonsynonymous substitution in exon 4 (D251H). This residue corresponds to position Q265 in human TAPBP and lies adjacent to residues involved in MHC I interaction, suggesting potential functional relevance. Furthermore, TAPBP exhibited high haplotype diversity (Hd = 0.93) and moderate nucleotide diversity ({pi} = 0.00892), with exon 5 showing the highest diversity ({pi} = 0.01). B9 was the most frequent haplotype at the nucleotide level, whereas B6/B24 predominated at the amino acid level. Comparison with BF2 data revealed haplotype-dependent pairing patterns: BF2-B9 consistently matched TAPBP-B9, whereas BF2-B6 was associated with distinct TAPBP nucleotide variants, indicating allelic diversification within a shared haplotypic background. Homozygosity at LEI0258 and the SNP panel corresponded with TAPBP homozygosity, supporting marker-based prediction. These findings highlight potential BF2-TAPBP associations and provide a foundation for understanding variation in MHC I peptide loading.